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Abstract. We give an algorithm for reversion of formal power series, based 
on an efficient way to evaluate the Lagrange inversion formula. Our algorithm 
requires 0{n 1 / 2 (M(n) + MM(n 1 / 2 ))) operations where M(n) and MM(n) are 
the costs of polynomial and matrix multiplication respectively. This matches 
an algorithm of Brent and Kung, but we achieve a constant factor speedup 
whose magnitude depends on the polynomial and matrix multiplication algo- 
rithms used. Benchmarks confirm that the algorithm performs well in practice. 



1. Introduction 

Classical algorithms for composition and reversion of power series truncated to 
length n require 0(n 3 ) operations on elements in the coefficient ring [S]. This can 
be improved to 0(nM(n)) where M{n) is the cost of multiplying two length-n 
polynomials. In [3J, Brent and Kung gave two asymptotically faster algorithms 
for composition, and observed that any algorithm for composition can be used for 
reversion (and vice versa) via Newton iteration, with at most a constant factor 
slowdown. 

The first algorithm (BK 2.1) requires 0(n 1/2 (M(n) + MM(n 1//2 ))) operations 
where MM(n) is the complexity of multiplying two nx n matrices. This reduces to 
(9(n 1 / 2 M(n) + n 2 ) with classical matrix multiplication, 0(n 1 / 2 M(n) + n 191 ) with 
the Strassen algorithm, and 0(n 1 / 2 M(n) 4-rt 1688 ) with the Coppersmith- Winograd 
algorithm [IT]. The last term has subsequently been improved to 0(n 1667 ) by 
Huang and Pan [8] using improved techniques for multiplication of nonsquare ma- 
trices. 

The second algorithm (BK 2.2) requires 0((n log n) 1 ^ 2 M(n)) operations. This 
is asymptotically slower than BK 2.1 when classical (M(n) = 0(n 2 )) or Karatsuba 
multiplication (M(n) = O(n los ^ 3 ) = 0(n 1585 )) is used, but faster when FFT 
polynomial multiplication (M(n) = 0(n log 1+ °^ n)) is available. 

As noted by Brent and Kung, many special compositions, including the evalua- 
tion of reciprocals, square roots, and elementary transcendental functions of power 
series, can be done in just M(n) operations. Recent research has focused on speed- 
ing up such evaluations by constant factors by eliminating redundancy from New- 
ton iteration [I] [5j [6] . Improved composition algorithms over special rings have also 
been investigated pjj[7]- However, the algorithms of Brent and Kung have remained 
the best known for composition and reversion in the general case. 

In this paper, we give a new algorithm for reversion analogous to BK 2.1 and 
likewise requiring 0(n 1 ' 2 (M (n) + MM (n 1 ' 2 ))) operations, but achieving a constant 
factor speedup. The speedup ratio depends on the asymptotics of M(n) and MM(n) 
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and is in the range between 1.2 and 2.6 with polynomial and matrix multiplication 
algorithms used in practice. Our algorithm also allows incorporating the complexity 
refinement of Huang and Pan, although the constant-factor improvement in this 
case becomes conjectural. 

Whereas BK 2.1 can be viewed as a baby-step giant-step version of Horner's 
rule, our algorithm can be viewed as a baby-step giant-step version of the Lagrange 
inversion formula, avoiding Newton iteration entirely (apart from a single 0(M(n)) 
reciprocal computation) . It is somewhat surprising that such an algorithm has been 
overlooked until now, with all publications following Brent and Kung apparently 
having taken Newton iteration as the final word on the subject matter. 

2. The algorithm 

Our setting is the ring of truncated power series i?[[x]]/ (x n ) over a commutative 
coefficient ring R in which the integers 1, . . . , n— 1 are cancellable. For example, we 
may take R = Z or R = Wi/pL with prime p > n. We recall the Lagrange inversion 
formula. If f(x) = x/h(x) where h(0) is a unit in R, then the compositional inverse 
or reversion f^ 1 (x) satisfying f(f~ 1 (x)) = f~ 1 (f(x)) = x exists and its coefficients 
are given by 

[x k ]r\x) = hx^hixf. 
k 

The straightforward way to evaluate n terms of f~ 1 (x) with the Lagrange in- 
version formula is to compute h(x) (this requires 0{M{n)) operations with New- 
ton iteration) and then compute the powers h 2 ,h 3 ,. . . successively, for a total of 
(n + 0(l))M(n) operations. 



Algorithm 1 Fast Lagrange inversion 

Input: / = a\x + a 2 x 2 + . . . + a„_ix™ _1 where n > 1 and a\ is a unit in R 
Output: g = b\x + . . . + b n ^ix n ^ 1 such that f(g(x)) = g{f(x)) = x mod x" 
m <- [y/n\ 
h <- x/f mod x"" 1 
for 1 < i < m do 

h l+1 <- h l x h mod x 11 - 1 
bi^Wx 1 - 1 }^ 
end for 
t <- h m 

for i = m, 2m, 3m, ... ,1m < n do 

h <- \[x^}t 

for 1 < j < m while i + j < n do 

b i+j <- j^nti-'ax^t) ■ (ix^-^w) 

end for 

t <— t x h m mod x n ^ 1 
end for 

return b\ + b 2 x + . . . + fe„_ix™ _1 



Our observation is that it is redundant to compute all the powers of h given that 
we only are interested in a single coefficient from each. To do better, we choose some 
1 < m < n and precompute h, h 2 , h 3 , . . . , h m . For < k < n, we can then write 
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h k as K t+ ^ where < j < m and i = lm for some < I < \n/m~\, only requiring 
h m , h 2m , h 3m , ... to be computed subsequently. Determining a single coefficient in 
h k = h l h? can then be done in 0(n) operations using the definition of the Cauchy 
product. Picking m « n 1 / 2 minimizes the number of polynomial multiplications 
required. 

We give a detailed account of this procedure as Algorithm [TJ We note that most 
of the polynomial arithmetic is done to length n — 1 rather than length n, as the 
initial coefficient always is zero. 

An improved version. Algorithm[T]clearly requires 0(n 1 ^ 2 M(n)+n 2 ) operations 
in R, as many as BK 2.1 with classical matrix multiplication. We can improve the 
complexity by packing the inner loops into a single matrix product as shown in 
Algorithm [21 This allows us to exploit fast matrix multiplication. 



Algorithm 2 Fast Lagrange inversion, matrix version 

Input: / = a\x + CL2X 2 + . . . + a n ^.\x n ~ x where n > 1 and a\ is a unit in R 
Output: g = b\x + . . . + b n ^ix n ~ 1 such that f(g(x)) = g(f(x)) — x mod x" 

m <— \\/n — 1] 

h<— x/f mod x"" 1 

Assemble m x m 2 matrices B and A from h,h 2 , . . . ,h m and h m , h 2m , h 3m . 
for 1 < i < m, 1 < j < m 2 do 

Bij <- [ x i+3-m-l} 

A hj <- [x" n - j ] h^-^ m 
end for 
C <— AB T 
for 1 < i < n do 

bi <— Ci/i (Ci is the ith entry of C read rowwise) 
end for 

return b\ + b2X + . . . + 6„_ix Tl_1 



In the description of Algorithm [21 the matrices are indexed from 1 and the 
pseudocode has been simplified by letting the exponents run out of bounds, using 
the convention that [x k ]p = when fe<0orfc>n — 1. To see that the algorithm 
is correct, write [^i+fe-iV-ij^i+fe-i)™ as 



ii+(«2— l)m— 1 



^ii+(i2— l)m— 1— j 



j=0 



h (i2-l)m\ 



and shift the summation index to obtain 



([x 11 ^-™- 1 ] h h ) ([x i2m - j ] /i (l2 ~ 1)m ) 

j=m—ii +l 

which is the inner product of the nonzero part of row i\ in B with the nonzero 
part of row ii in A. 

The structure of the matrices is perhaps illustrated more clearly by an example. 
Taking n — 8 and m — 3, we need the coefficients of 1, x, . . . , x 6 in powers of h. 
Letting h\ denote [x l ]h k , the matrices become 
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° 2 hi hi 0' 
| h\ hi h\ h\ hi 
(h 6 7 ) h e 6 h 6 5 hi h 6 3 hi h\ hlj 

hi h\ h\ h\ h\ h\ h\ 
B = ( hi h\ h\ hi h\ h 2 b hi (hf) 
\ hi hi hi h\ hi hi (h 3 r ) (hi), 

where entries in parentheses do not contribute to the final result and may be set 
to zero. In this example the coefficient of x 4 in ft, 5 is given by the fifth entry in C, 
namely C 2 a = h\hl + h\h\ + h\h\ + h\h\ + hlh 2 A . 

3. Complexity analysis 



We now study the complexity in some more detail. Let to = [vn--T| . Then 
Algorithm [5] involves at most: 

(1) 2m + 0(1) polynomial multiplications, each with cost M(n) 

(2) One (to x to 2 ) times (to 2 x to) matrix multiplication 

(3) 0(n) additional operations 

For comparison, BK 2.1 requires at most: 

(1) to polynomial multiplications, each with cost M(n) 

(2) One (to x m) times (m x to 2 ) matrix multiplication 

(3) to polynomial multiplications and additions, each with cost M(n) + n 

Brent and Kung break the matrix multiplication into to products of to x to 
matrices, requiring mMM(m) operations. We can do the same in Algorithm [21 
writing the product as a length-m inner product of m x to matrices. The extra 
0(n 3 / 2 ) additions in this matrix operation do not affect the complexity, but it is 
interesting to note that they match the 0(n 3 / 2 ) additions in the last polynomial 
stage of BK 2.1. To summarize, both Algorithm [2] and BK 2.1 require at most 
(2n 1 / 2 + 0(l))M(n) + n 1 l' 1 MM(n 1 l' 2 ) + 0(n 3 / 2 ) operations. 

The primary drawback of our algorithm as well as BK 2.1 is the requirement to 
store 0(n 3 / 2 ) temporary coefficients in memory, compared to 0(n log n) for BK 2.2 
and O(n) for a naive implementation of Lagrange inversion. 

Avoiding Newton iteration. In effect, we need the same number of operations 
to perform a length-n reversion with fast Lagrange inversion as to perform a length- 
n composition with BK 2.1. However, to perform a reversion with BK 2.1, we must 
employ Newton iteration. Using the update 

9k+i[x) = —— — -^-r— , 
f K9k{x)) 

where the chain rule allows us to reuse the composition in the numerator for 
the denominator, this entails computing a sequence of compositions of length I = 
1, . . . , [n/4] , [n/2] , n, plus a fixed number of polynomial multiplications of the 
same length at each stage. If c and r are such that a length-n composition takes 
C(n) — cn r operations, Newton iteration asymptotically takes 



C(n) + C(n/2) + C{n/A) + 



2 r 
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operations, ignoring additional polynomial multiplications. For example, with 
classical polynomial multiplication as the dominant cost (r = 5/2), the speedup 
given by the expression in parentheses is jj-(8+v2) ~ 1.214 . With FFT polynomial 
multiplication, and classical matrix multiplication as the dominant cost (r = 2), 
the speedup is 4/3. We note that a more efficient form of the Newton iteration 
might exist, in which case the speedup would be smaller. 

Improving the matrix multiplication. If the matrix multiplication dominates, 
we can gain an additional speedup from the fact that the ith to x to block of the 
matrix A only has m — i + 1 nonzero rows, whereas the matrices in BK 2.1 arc 
full. Classically this gives a twofold speedup, reflected in the loop boundaries of 
Algorithm [TJ We should ideally modify Algorithm [2] to include this saving. 

In fact, a speedup is attainable with any square matrix multiplication algorithm 
MM(m) ~ m" where uj > 2. For simplicity, assume that to is sufficiently composite. 
Do the first m/2 products as full products of size to, the next (m/2 — m/3) in blocks 
of size m/2, the next (m/3 — to/4) in blocks of size m/3, and so on. At stage k, 
only k 2 products of blocks of size m/k are required. The speedup achieved through 
this procedure is 

where the nontrivial inequality can be obtained by considering the analogous 
subdivison with blocks of size m/2 k only. 

Alternatively, we can write AB T = (AP)(P^ 1 B T ) where P is a permutation 
matrix that makes each to x to block in A lower triangular, and use any algorithm 
that speeds up multiplication between a full and a triangular matrix. A simple 
recursive decomposition of size- A; blocks into size-fc/2 blocks has a proportional 
cost of CO) = 4C*(fc/2) + 2(k/2) u + 0(k 2 ), providing a speedup of 2"- 1 - 2. 
This is greater than 1 when lu > log 2 6 s» 2.585, and better than the first method 
when ui > 1 + log 2 (2 + y/2) rj 2.771. In particular, we recover an optimal factor- 
two speedup with classical multiplication, and a 3/2 speedup with the Strassen 
algorithm. 

Using rectangular multiplication. In the preceding analysis, we have multiplied 
tox to 2 matrices via decomposition into square blocks. Remarkably, Huang and Pan 
have shown [5] that this is not asymptotically optimal with the best presently known 
algorithms. Letting MM{x, y, z) denote the complexity of multiplying an x x y 
matrix by a y x z matrix, Huang and Pan show that MM(m, to, to 2 ) = 0(n 1667 ), 
improving on the best known bound mMM(m,m,m) — 0(n 1688 ) obtained via 
multiplication of square matrices. 

As the exponents of MM(m,m 2 ,m) and MM(m, to, to 2 ) are the same ([8 , eq. 
2.7), this improvement also applies to the matrix product in Algorithm^ We are 
unfortunately unable to claim a constant-factor speedup in this situation, although 
it can plausibly be conjectured that any multiplication algorithm admits a dual 
version with MM (to, to 2 , to) = (1 + o(l))MM(m, m, to 2 ). In any case, the improve- 
ment of Huang and Pan is currently only of theoretical interest, as the advantage 
probably only can be realized for infeasibly large matrices. 
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Table 1. Theoretical speedup of Algorithm 2 over BK 2.1 due to 
avoiding Newton iteration and exploiting the matrix structure. 



Dominant operation 


Complexity 


Newton 


Matrix 


Total 


Polynomial, classical 


0{n b ' 2 ) 


1.214 


1 


1.214 


Polynomial, Karatsuba 


( n l/2+log 2 3) 


1.308 


1 


1.308 


Matrix, classical 


0{n 2 ) 


1.333 


2.000 


2.666 


Matrix, Strassen 


0(n (l+log 2 7)/ 2) 


1.364 


1.500 


2.047 


Matrix, Cop.-Win. 


( n 1.6B8) 


1.450 


1.229 


1.782 


Matrix, Huang-Pan 


0(n lmr ) 


1.458 


1? 


1.458? 


(Polynomial, FFT) 


0(n 3 / 2 log 1+o(1) n) 


1.546 


1 


1.546 



Practical performance. Table[T]gives a summary of the speedup gained by Algo- 
rithm[5]ovcr BK 2.1 with various matrix and polynomial multiplication algorithms. 
The last row gives the speedup assuming that the cost of matrix multiplication can 
be ignored. 

With FFT-based polynomial multiplication, BK 2.2 is asymptotically faster than 
BK 2.1 and hence also than Algorithm [2] In practice, however, the overhead of 
quasilinear polynomial multiplication compared to matrix multiplication is likely 
to be large. Fast Lagrange inversion can therefore be expected to be faster than 
not only BK 2.1 but also BK 2.2 even for quite large n. 

Of course, counting ring operations may not accurately reflect actual speed since 
operations in most interesting rings take a variable amount of time to execute on a 
physical computer. One consequence of this fact is that Newton iteration is likely 
to impose a smaller overhead than predicted, since coefficients generally are smaller 
in earlier steps than in later ones. Newton iteration can also be expected to perform 
better than generically when the output as a whole has small coefficients. 

We note that fast Lagrange inversion becomes faster than generically when the 
coefficients of x/f(x) grow slowly. This is often the case when f(x) is a rational 
function. Although specialized algorithms can revert rational functions even faster, 
it is desirable for a general-purpose algorithm to be efficient in this common case, 
and Lagrange inversion of course also works for nonrational functions having this 
growth property. 

4. Benchmarks 

We have implemented tuned versions of naive Lagrange inversion ("Lagrange"), 
BK 2.1 with Newton iteration, and Algorithm [T] ( "Fast Lagrange") over Z/pZ, Z 
and Q as part of the FLINT C library [4]. For each of these rings, FLINT provides 
fast coefficient arithmetic (using MPIR [10 for bignum arithmetic) and asymptoti- 
cally fast polynomial multiplication using Kronecker segmentation. Strassen matrix 
multiplication is exploited in the implementation of BK 2.1 over Z/pZ, although 
this does not contribute appreciably for practical n. 

Timings over Z/pZ obtained on a Pentium T4400 2.2 GHz CPU with 3 GiB of 
RAM are given in Table [5] Algorithm [T] consistently runs about 1.6 times faster 
than BK 2.1, roughly agreeing with a predicted speedup of 1.546 with quasilinear 
polynomial multiplication and negligible cost of matrix multiplication. We find 
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Table 2. Timings for reversing a random power series over 
Z/pZ,p = 2 63 + 29. 



n 


Lagrange 


BK 2.1 


Fast Lagrange 


10 


12 |is 


21 U-S 


7.0 lis 


100 


3.8 ms 


1.3 ms 


0.76 ms 


1000 


950 ms 


100 ms 


62 ms 


10000 


150 s 


5.1 s 


3.3 s 


100000 


~ 6 h 


260 s 


160 s 



Table 3. Timings for reversing fi(x) — J2k>i k\x k , f2(x) — 
/ 3 (ar) = *+f a over Z. 



n 


Laj 


;range 






BK 2.1 




Fast Lagrange 




fx 


h 


fa 


/i 


h 


h 


fi 


h 


h 


10 


10 us 


10 


8.4 


16 


15 


14 


6.8 


5.7 


5.5 


50 


3.7 ms 


1.0 


0.46 


1.2 


0.45 


0.40 


1.1 


0.28 


0.14 


100 


65 ms 


10 


4.4 


12 


2.8 


2.8 


13 


1.4 


0.87 


500 


23 s 


3.0 


1.2 


2.5 


0.37 


0.36 


2.3 


0.16 


0.084 


1000 


280 s 


30 


11 


22 


2.8 


2.4 


18 


2.0 


0.54 


5000 








4100 


s 340 


230 


2400 


110 


46 



Table 4. Timings for reversing f±{x) = cxp(x) — l,/s(x) — 
xexp(x),f 6 (x) = J^^Jyi over Q. 



n 


Laf 


grange 






BK 2.1 




Fast Lag 


range 




h 


h 


k 


U 


h 


h 


h 


h 


h 


10 


22 us 


20 


19 


42 


44 


44 


15 


14 


13 


50 


5.6 ms 


5.3 


1.3 


2.4 


3.5 


2.2 


1.8 


1.6 


0.38 


100 


78 ms 


73 


12 


16 


27 


14 


17 


14 


2.1 


500 


30 s 


27 


3.1 


2.1 


3.8 


1.3 


3.3 


2.6 


0.18 


1000 


340 s 


300 


30 


17 


32 


8.8 


30 


25 


1.3 



that the computation time in BK 2.1 indeed is dominated by polynomial multipli- 
cations rather than by matrix multiplication for feasible n. For example, matrix 
multiplication only takes up 10-20% of the time at both n = 10 4 and n = 10 5 . 
This suggests (along with timings of a preliminary implementation of BK 2.2) that 
BK 2.2 would not be able to beat BK 2.1 in the tested range. 

Over Z and particularly over Q, ring operations do not take constant time and 
the actual performance becomes highly sensitive to the input. This is reflected in 
Tables |31 and SJ We observe that BK 2.1 is faster on f± (whose output coefficients 
are smaller than those of f§) while Lagrange inversion handles the rational functions 
/3 and fe substantially faster. 

Great care must be taken to handle denominators efficiently. In our implementa- 
tion of BK 2.1, we found that naive matrix multiplication over Q took ten times as 
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long as polynomial multiplications. Clearing denominators and multiplying matri- 
ces over Z resulted in a comparable time being spent on the matrix and polynomial 
stages. Similar concerns apply when implementing Algorithms [T] and [2] On the 
other hand, translating the entire series composition or reversion to one over Z by 
rescaling the inputs typically results in too much coefficient inflation, and can even 
run slower than a classical algorithm working directly over Q. We expect the situ- 
ation to be similar when working with e.g. parametric power series having rational 
functions as coefficients. 

5. Conclusion 

Fast Lagrange inversion is a practical algorithm for reversion of formal power se- 
ries, having essentially no higher overhead than a naive implementation of Lagrange 
inversion for small n and requiring fewer coefficient operations than Newton itera- 
tion coupled with BK 2.1 for large n. Among currently available general-purpose 
algorithms, it is likely to be the fastest choice for typical coefficient rings, input se- 
ries, and values of n, and may thus be a good choice as a default reversion algorithm 
in a computer algebra system. 

Newton iteration with BK 2.2 remains faster asymptotically when FFT polyno- 
mial multiplication is available, and uses less memory, but may require very large 
n to become advantageous. Possible directions for future research could be to iden- 
tify improvements over special rings or look for constant-factor improvements in 
reversion via BK 2.2. Further study of the special matrix multiplications arising in 
Algorithm [2] and BK 2.1 would also be warranted. 
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